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' The paper investigates the computational problem of predicting RNA secondary 

^SJ , structures. The general belief is that allowing pseudoknots makes the problem hard. 

Existing polynomial-time algorithms are heuristic algorithms with no performance guar- 
^t"! ' antee and can only handle limited types of pseudoknots. In this paper we initiate the 

, study of predicting RNA secondary structures with a maximum number of stacking 

^ ' pairs while allowing arbitrary pseudoknots. We obtain two approximation algorithms 

O ■ with worst-case approximation ratios of 1/2 and 1/3 for planar and general secondary 

structures, respectively. For an RNA sequence of n bases, the approximation algorithm 
for planar secondary structures runs in O(n^) time while that for the general case runs 
in linear time. Furthermore, we prove that allowing pseudoknots makes it NP-hard to 
(y-^ I maximize the number of stacking pairs in a planar secondary structure. This result is in 

' contrast with the recent NP-hard results on psuedoknots which are based on optimizing 

some general and complicated energy functions. 



1 Introduction 

Ribonucleic acids (RNAs) are molecules that are responsible for regulating many genetic and 
metabolic activities in cells. An RNA is single-stranded and can be considered as a sequence 
^ ■ of nucleotides (also known as bases). There are four basic nucleotides, namely. Adenine (A), 

Cytosine (C), Guanine (G), and Uracil (U). An RNA folds into a 3-dimensional structure 
by forming pairs of bases. Paired bases tend to stabilize the RNA (i.e., have negative free 
energy). Yet base pairing does not occur arbitrarily. In particular, A-U and C-G form 
stable pairs and are known as the Watson- Crick base pairs. Other base pairings are less 
stable and often ignored. An example of a folded RNA is shown in Figure |l|. Note that 
this figure is just schematic; in practice, RNAs are 3-dimensional molecules. 
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The 3-dimensional structure is related to the function of the RNA. Yet existing exper- 
imental techniques for determining the 3-dimensional structures of RNAs are often very 
costly and time consuming (see, e.g., [^). The secondary structure of an RNA is the set 
of base pairings formed in its 3-dimensional structure. To determine the 3-dimensional 
structure of a given RNA sequence, it is useful to determine the corresponding secondary 
structure. As a result, it is important to design efficient algorithms to predict the secondary 
structure with computers. 

From a computational viewpoint, the challenge of the RNA secondary structure pre- 
diction problem arises from some special structures called pseudoknots, which are defined 
as follows. Let S be an RNA sequence si,S2,--- ,Sn- A pseudoknot is composed of two 
interleaving base pairs, i.e., {si,Sj) and such that i < k < j < i. See Figure |2| for 

examples. 

If we assume that the secondary structure of an RNA contains no pseudoknots, the 
secondary structure can be decomposed into a few types of loops: stacking pairs, hairpins, 
bulges, internal loops, and multiple loops (see, e.g., Tompa's lecture notes Q or Waterman's 
book A stacking pair is a loop formed by two pairs of consecutive bases {si,Sj) and 

(sj+i,Sj_i) with i + 4: < j. See Figure |l] for an example. By definition, a stacking pair 
contains no unpaired bases and any other kinds of loops contain one or more unpaired bases. 
Since unpaired bases are destabilizing and have positive free energy, stacking pairs are the 
only type of loops that have negative free energy and stabilize the secondary structure. It 
is also natural to assume that the free energies of loops are independent. Then an optimal 
pseudoknot-free secondary structure can be computed using dynamic programming in O(n^) 
time 

However, pseudoknots are known to exist in some RNAs. For predicting secondary 
structures with pseudoknots, Nussinov et al. have studied the case where the energy 
function is minimized when the number of base pairs is maximized and have obtained an 
0(?7-^)-time algorithm for predicting secondary structures. Based on some special energy 
functions, Lyngso and Pedersen [Q] have proven that determining the optimal secondary 
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Figure 2: Examples of pseudoknots 



structure possibly with pseudoknots is NP-hard. Akutsu [|| has shown that it is NP-hard 
to determine an optimal planar secondary structure, where a secondary structure is planar 
if the graph formed by the base pairings and the backbone connections of adjacent bases 
is planar (see Section 2 for a more detailed definition). Rivas and Eddy Uemura et al. 
]lO| , and Akutsu ||T| have also proposed polynomial-time algorithms that can handle limited 
types of pseudoknots; note that the exact types of such pseudoknots are implicit in these 
algorithms and difficult to determine. 

Although it might be desirable to have a better classification of pseudoknots and better 
algorithms that can handle a wider class of pseudoknots, this paper approaches the prob- 
lem in a different general direction. We initiate the study of predicting RNA secondary 
structures that allow arbitrary pseudoknots while maximizing the number of stacking pairs. 
Such a simple energy function is meaningful as stacking pairs are the only loops that stabi- 
lize secondary structures. We obtain two approximation algorithms with worst-case ratios 
of 1/2 and 1/3 for planar and general secondary structures, respectively. The planar ap- 
proximation algorithm makes use of a geometric observation that allows us to visualize the 
planarity of stacking pairs on a rectangular grid; interestingly, such an observation does not 
hold if our aim is to maximize the number of base pairs. This algorithm runs in O(n^) time. 
The second approximation algorithm is more complicated and is based on a combination of 
multiple "greedy" strategies. A straightforward analysis cannot lead to the approximation 
ratio of 1/3. We make use of amortization over different steps to obtain the desired ratio. 
This algorithm runs in 0(n) time. 

To complement these two algorithms, we also prove that allowing pseudoknots makes 
it NP-hard to find the planar secondary structure with the largest number of stacking 
pairs. The proof makes use of a reduction from a well-known NP-complete problem called 
Tripartite Matching Q. This result indicates that the hardness of the RNA secondary 
structure prediction problem may be inherent in the pseudoknot structures and may not 
be necessarily due to the complication of the energy functions. This is in contrast to the 
other NP-hardness results discussed earlier. 

The rest of this paper is organized into four sections. Section 2 discusses some basic 
properties. Sections 3 and 4 present the approximation algorithms for planar and general 
secondary structures, respectively. Section 5 details the NP-hardness result. Section 6 
concludes the paper with open problems. 
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2 Preliminaries 



Let S = S1S2 ■ • • be an RNA sequence of n bases. A secondary structure "P of 5* is a set of 
Watson-Crick pairs (sj^, SjJ, . . . , (sj^, Sj^), where Si^ + 2 < Sj^ for all r = 1, . . . ,p and no 
two pairs share a base. We denote q {q > 1) consecutive stacking pairs (sj, Sj), (sj+i, Sj-i); 
(si+i, Sj-i), (sj+2,Sj_2) ... (sj+g-i, (sj_|_g, Sj_q) of P by (sj, Sj+i, . . . , Sj+g; 

. . . , Sj_l, Sj). 

Definition 1 Given a secondary structure V, we define an undirected graph G{V) such 
that the bases of S are the nodes of G(V) and {si,Sj) is an edge of G{V) if j = i + 1 or 
(sj, Sj) is a base pair in V. 

Definition 2 A secondary structure V is planar if G{V) is a planar graph. 

Definition 3 A secondary structure V is said to contain an interleaving block ifV contains 
three stacking pairs {si,SiJ^i;sj-i,Sj), {si/,Siij^i;sj/-i,Sj/), (sj", Si"+i; Sj"_i, Sj") where i < 
i' < i" <j< f < j". 

Lemma 2.1 If a secondary structure V contains an interleaving block, V is non-planar. 

Proof. Suppose V contains an interleaving block. Without loss of generality, we assume 
that V contains the stacking pairs (si, S2; sj, sg), [s^, S4; sg, sio), and (55, sq; sh, S12). Figure 
|3|(a) shows the subgraph of G{V) corresponding to these stacking pairs. Since this subgraph 
contains a homeomorphic copy of Ks,3 (see Figure §(b)), G^P) and V are non-planar. □ 




Figure 3: Interleaving block 
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3 An Approximation Algorithm for Planar Secondary Struc- 
tures 



We present an algorithm which, given an RNA sequence S = siS2 ■ ■ ■ Sn, constructs a planar 
secondary structure of S to approximate one with the maximum number of stacking pairs 
with a ratio of at least 1/2. This approximation algorithm is based on the subtle observation 



in Lemma 3J. that if a secondary structure V is planar, the subgraph of G'('P) which contains 
only the stacking pairs of V can be embedded in a grid with a useful property. This property 
enables us to consider only the secondary structure of S without pseudoknots in order to 
achieve 1/2 approximation ratio. 

Definition 4 Given a secondary structure V, we define a stacking pair embedding of V 
on a grid as follows. Represent the bases of S as n consecutive grid points on the same 
horizontal grid line L such that Si and Sj+i (1 < i < n) are connected directly by a horizontal 
grid edge. If {si, Si+i; sj-i, sj) is a stacking pair ofV, Si and Sj+i are connected to sj and 
Sj^i respectively by a sequence of grid edges such that the two sequences must be either both 
above or both below L. 

Figure ^ shows a stacking pair embedding (Figure ^b)) of a given secondary structure 
(Figure @(a)). Note that (53,59) do not form a stacking pair with other base pair, so S3 is 
not connected to sg in the stacking pair embedding. Similarly, 54 is not connected to sio 
in the embedding. 
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(a) ^ (b) 

Figure 4: An example of a stacking pair embedding 



Definition 5 A stacking pair embedding is said to be planar if it can be drawn in such a 
way that no lines cross or overlap with each other in the grid. 

The embedding shown in Figure 0(b) is planar. 

Lemma 3.1 Let V be a secondary structure of an RNA sequence S. Let E be a stacking 
pair embedding of V. If V is planar, then E must be planar. 
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Proof. If V does not have a planar stacking pair embedding, we claim that V contains 
an interleaving block. Let L be the horizontal grid line that contains the bases of S in E. 
Since V does not have a planar stacking pair embedding, we can assume that E has two 
stacking pairs intersect above L (see Figure ^(a)). 



(a) 



(b) 



(d) 



(e) 



(c) 



(f) 



(g) (h) (>) 

Figure 5: Non-planar stacking pair embedding 



If there is no other stacking pair underneath these two pairs, we can flip one of the pairs 
below L as shown in Figure |5|(b). So, there must be at least one stacking pair underneath 
these two pairs. By checking all possible cases (all non-symmetric cases are shown in Figures 
|5|(c) to (i)), it can be shown that E cannot be redrawn without crossing or overlapping lines 



only if it contains an interleaving block (Figures ^(h) and (i)). So, by Lemma 2.1, V is 
non-planar. □ 



By Lemma 3.1, we can relate two secondary structures having the maximum number of 



stacking pairs with and without pseudoknots in the following lemma. 

Lemma 3.2 Given an RNA sequence S, let N* be the maximum number of stacking pairs 
that can be formed by a planar secondary structure of S and let W be the maximum number 
of stacking pairs that can be formed by S without pseudoknots. Then, W > 

Proof. Let V* be a planar secondary structure of S with A'^* stacking pairs. Since V* 



is planar, by Lemma 3.1, any stacking pair embedding of V* is planar. 

Let S be a stacking pair embedding of V* such that no lines cross each other in the grid. 
Let L be the horizontal grid line of E which contains all bases of S. Let ni and n2 be the 
number of stacking pairs which are drawn above and below L, respectively. Without loss 
of generality, assume that ni > n2. Now, we construct another planar secondary structure 
V from E by deleting all stacking pairs which are drawn below L. Obviously, 7^ is a planar 

2 



secondary structure of S without pseudoknots. Since ni > n2, ni > As 11^ > ni, 
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Based on Lemma 3^, we now present the dynamic programming algorithm MaxSP 
which computes the maximium number of stacking pairs that can be formed by an RNA 
sequence S = siS2 ■ ■ ■ Sn without pseudoknots. 

Algorithm MaxSP 

Define V{i, j) (for j > i) as the maximum number of stacking pairs without pseudoknots 
that can be formed by Sj . . . Sj if Si and Sj form a Watson-Crick pair. Let W{i,j) {j > i) be 
the maximum number of stacking pairs without pseudoknots that can be formed by Sj . . . sj. 
Obviously, W{l,n) gives the maximum number of stacking pairs that can be formed by S 
without pseudoknots. 



Basis: 



For j = i,i + l,i + 2 OT i + 3 {j < n), 



V{i,j) =0 if Sj, Sj form a Watson-Crick pair; 
Wii,j) =0. 



Recurrence: 



For J > i + 3, 
W{i,j) 



max < 



V{i,j) if Si, Sj form a Watson-Crick pair 

W{i + l,j) 

W{i,j-1) 



Vii,j) 



max 



V{i + 1, j — 1) + 1 if Sj+i, Sj-i form a Watson-Crick pair 
maxi+i<fc<j_2 {W{i + l,k) + W{k + l,j - 1)} 



Lemma 3.3 Given an RNA sequence S of length n, Algorithm MaxSP computes the 
maximum number of stacking pairs that can be formed by S without pseudoknots in 0{n'^) 
time and 0{n'^) space. 

Proof. There are 0{n'^) entries V{i,j) and W{i,j) to be filled. To fill an entry of 
V{i,j), we check at most 0(n) values. To fill an entry of W{i,j), 0(1) time suffices. The 
total time complexity for filling all entries is O(n^). Storing all entries requires O(n^) space. 
□ 

Although Algorithm MaxSP presented in the above only computes the number of 
stacking pairs, it can be easily modified to compute the secondary structure. Thus we have 
the following theorem. 

Theorem 3.4 The Algorithm MaxSP is an (1/2) -approximation algorithm for the prob- 
lem of constructing a secondary structure which maximizes the number of stacking pairs for 
an RNA sequence S. 
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4 An Approximation Algorithm for General Secondary Struc- 
tures 



We present Algorithm GreedySPQ which, given an RNA sequence S = siS2 ■ ■ ■ Sn, con- 
structs a secondary structure of S (not necessarily planar) with at least 1/3 of the maximum 
possible number of stacking pairs. The approximation algorithm uses a greedy approach. 
Figure ^ shows the algorithm GreedySP{). 

II Let S = S1S2 ■ ■ ■ Sji be the input RNA sequence. Initially, all sj are unmarked. 
// Let E be the set of base pairs output by the algorithm. Initially, E = 9. 

GreedySP{S,i) lli>3 

1. Repeatedly find the leftmost i consecutive stacking pairs SP (i.e., find 
(sp, . . . , Sp+i; ... , Sq) such that p is as small as possible) formed by un- 
marked bases. Add SP to E and mark all these bases. 

2. For k = i — 1 downto 2, 

Repeatedly find any k consecutive stacking pairs SP formed by unmarked bases. 
Add SP to E and mark all these bases. 

3. Repeatedly find the leftmost stacking pair SP formed by unmarked bases. Add 
SP to E and mark all these bases. 

Figure 6: A 1/3- Approximation Algorithm 

In the following, we analyze the approximation ratio of this algorithm. The algorithm 
Greedy SP{S, i) will generate a sequence of 5P's denoted by SPi, SP2, ■ ■ ■ ,SPh- 

Fact 4.1 For any SPj and SP^ (j ^ k), the stacking pairs in SPj do not share any base 
with those in SPk- 

For each SPj = {sp, . . . , Sp+t', Sq-t, ■ ■ ■ , Sq), we define two intervals of indexes, Ij and 
J'j, as [p..p + t] and [q — t..q], respectively. In order to compare the number of stacking pairs 
formed with that in the optimal case, we have the following definition. 

Definition 6 Let V he an optimal secondary structure of S with the maximum number 
of stacking pairs. Let T be the set of all stacking pairs of V . For each SPj computed by 
Greedy SP{S, i) and j3 = Zj or Jj, 

let Xp = {(sfc, Sfc+i; Syj-i, Syj) G T\ at least one of indexes k,k + l,w — l,w is in P}. 

Note that <-f^'s may not be disjoint. 

Lemma 4.2 Ui<j< U XjJ = T. 

Proof. We prove this lemma by contradiction. Suppose that there exists a stacking 
pair (sfc, Sfc+i; s^) in T but not in any of Xx^ and Xj^. By Definition |6|, none of the 
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indexes, k,k + l,w — lyW is in any of 2j and Jj. This contradicts with Step 3 of Algorithm 
GreedySP{S,i). □ 



Definition 7 For each , 

let X'-x^ = Xx^ - U {Xx^ U Xj^}, and let X'j^ = Xj^ - |J {Xj^ U XjJ - Xj^ 
k<3 k<j 

Let \SPj \ be the number of stacking pairs represented by SPj. Let \2j \ and \J'j\ be the 
numbers of indexes in the intervals Ij and J'j, respectively. 

Lemma 4.3 Let N be the number of stacking pairs computed by Algorithm Greedy SP{S, i) 
and N* be the maximum number of stacking pairs that can be formed by S. If for all j, we 
have \SPj\ > i x \{X^, U X'j,)\, then N > ^ x N* . 



Proof By Definition 0, [j^iXj^^ u XjJ = UJ-^, U Xi^J. Then by Fact ^ N = 
\SPj\. Thus, > i X I [JkiXi, U XjJ\. By Lemma g]! iV > i x A*. □ 

Lemma 4.4 For each SPj computed by GreedySP{S,i), we have \SPj\ > lx\{XiuX'j,)\. 
Proof. There are three cases as follows. 

Case 1: SPj is computed by GreedySP{S,i) in Step 1. Note that SPj = {sp, . . . ,Sp+i; 

... , Sg) is the leftmost i consecutive stacking pairs, i.e., p is the smallest possible. By 
definition, \X'j.\ < i+2. We further claim that < i+l. Then \SPj\/\X!j- \JX'j.\ > 

+ 1) + (i + 2)) > 1/3 (as i > 3). 
We prove the claim by contradiction. Assume that \X^ \ = i + 2. That is, for some 
integer t, has i+2 consecutive stacking pairs (sp_i, . . . , Sp+j+i; . . . , st+i). Further- 

more, none of the bases . . . , Sp+j+i, sj_j_i, . . . , st+i are marked before SPj is chosen; 
otherwise, suppose one such base, says Sq, is marked when the algorithm chooses SP^ for 
I < j, then an stacking pair adjacent to Sa does not belong to Xj and they belong to Xj^ 
or Xj^ instead. Therefore, . . . , Sp+j_i; st-i+i, ■ ■ ■ , st+i) is the leftmost i consecutive 

stacking pairs formed by unmarked bases before SPj is chosen. As SPj is not the leftmost i 
consecutive stacking pairs, this contradicts the selection criteria of SPj. The claim follows. 

Case 2: SPj is computed by GreedySP{S,i) in Step 2. Let \SPj\ = k > 2. Let SPj = 
{sp, . . . , Sp+fc; ... ,Sg). By definition, \Xj], \Xj, \ < k + 2. We claim that \Xj \, \Xj, \ < 
k + 1. Then \SPj\/\X^. U Xi^, \ > k/{{k + 1) + {k + 1)), which is at least 1/3 as yfc > 2. ' 

To show that \Xj,\ < A; + 1 by contradiction, assume \Xj,\ = k + 2. Thus, for some 
integer t, there exist k + 2 consecutive stacking pairs (sp_i, . . . , Sp+^+i; st-k-i-, ■ ■ ■ , st+i). 
Similarly to case 1, we can show that none of the bases Sp_i, . . . , Sp+^+i, Sf_fc_i, . . . , st+i 
are marked before SPj is chosen. Thus, GreedySP{S, i) should select some A; + 1 or A; + 2 
consecutive stacking pairs instead of the chosen k consecutive stacking pairs, reaching a 
contradiction. Similarly, we can show \ < A; + 1. 

Case 3: SPj is computed by GreedySP{S, i) in Step 3. SPj is the leftmost stacking pair 
when it is chosen. Let SPj = {sp, Sp+i; Sg_i, Sg). By the same approach as in Case 2, we can 
show \X^], \X'j\ < 2. We further claim \XJ^, \ < 1. Then \SPj\/\X^UXij;\ > 1/(1+2) = 1/3. 
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To verify < 1, we consider all possible cases with \'^x.\ = 2 while there are no 

two consecutive stacking pairs. The only possible case is that for some integers r, t, both 
{sp-i, Sp] Sr-i, Sr) and {sp, Sp+i; st-i, St) belong to Afj,. Then, SPj cannot be the leftmost 
stacking pair formed by unmarked bases, contradicting the selection criteria of SPj. □ 

Theorem 4.5 Let S be an RNA sequence. Let N* be the maximum number of stacking 
pairs that can be formed by any secondary structure of S. Let N be the number of stacking 
pairs output by GreedySP{S,i). Then, N > 

Proof. By Lemmas [4.3| and [4.4| , the result follows. □ 

We remark that by setting i = 3 in Greedy SP{S, i), we can already achieve the approx- 
imation ratio of 1/3. The following theorem gives the time and space complexity of the 
algorithm. 

Theorem 4.6 Given an RNA sequence S of length n and a constant k, Algorithm 
GreedySP{S,k) can be implemented in 0[n) time and 0{n) space. 

Proof. Recall that the bases of an RNA sequence are chosen from the alphabet 
{A, {/, G, C}. If A; is a constant, there are only constant number of different patterns of 
consecutive stacking pairs that we must consider. For any 1 < i < A;, there are only 
different strings that can be formed by the four characters {A, U,G,C}. So, the locations of 
the occurrences of these possible strings in the RNA sequence can be recorded in an array 
of linked lists indexed by the pattern of the string using 0(n) time preprocessing. There 
are at most 4-^ linked lists for any fixed j and there are at most n entries in these linked 
lists. In total, there are at most kn entries in all linked lists for all possible values of j. 

Now, we fix a constant j. To locate all j consecutive stacking pairs, we scan the RNA 
sequence from left to right. For each substring of j consecutive characters, we look up the 
array to see whether we can form j consecutive stacking pairs. By simple bookkeeping, 
we can keep track which bases have been used already. Each entry in the linked lists will 
only be scanned at most once, so the whole procedure takes only 0{n) time. Since /c is a 
constant, we can repeat the whole procedure for k different values of j, and the total time 
complexity is still 0(n) time. □ 

5 NP-completeness 

In this section, we show that it is NP-hard to find a planar secondary structure with the 
largest number of stacking pairs. We consider the following decision problem. Given an 
RNA sequence S and an integer h, we wish to determine whether the largest possible 
number of stacking pairs in a planar secondary structure of S, denoted sp(S'), is at least 
h. Below we show that this decision problem is NP-complete by reducing the tripartite 
matching problem Q to it, which is defined as follows. 

Given three node sets X, Y, and Z with the same cardinality n and an edge set E C 
X X Y X Z oi size m, the tripartite matching problem is to determine whether E contains 
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a perfect matching, i.e., a set of n edges which touches every node of X, Y, and Z exactly 
once. 



The remainder of this section is organized as follows. Section 5.1 shows how we construct 
in polynomial time an RNA sequence Se and an integer h from a given instance {X, Y, Z, E) 
of the tripartite matching problem, where h depends on n and m. Section |5.2| shows that 
if E contains a perfect matching, then sy>{Se) > h. Section |5^ is the non-trivial part, 
showing that if E does not contain a perfect matching, then s^{Se) < h. Combining these 
three sections, we can conclude that it is NP-hard to maximize the number of stacking pairs 
for planar RNA secondary structures. 



5.1 Construction of the RNA sequence Se 

Consider any instance {X, Y, Z, E) of the tripartite matching problem. We construct an 
RNA sequence Se and an integer h as follows. Let X = {xi, • • • ,x„}, Y = {yi, ■ • • ,yn}, 
and Z = {zi,--- ,Zn}- Furthermore, let E = {ei,e2,--- ,6^}, where each edge ej = 
{xpj,yq-,Zrj). Recall that an RNA sequence contains characters chosen from the alphabet 
{A, U, G, C}. Below we denote A^, where i is any positive integer, as the sequence of i A^s. 
Furthermore, A~^ means a sequence of one or more ^'s. 

Let d = max{6n,4(m + 1)} + 1. Define the following four RNA sequences for every 
positive integer k < d. 

• 5{k) is the sequence U'^A^GU'^A'^-'', and 6(k) is the sequence U'^-'^A'^GU'^A'^. 

• 7r{k) is the sequence c'^d+2k j^Q(jAd~2k ^ ^^^j -g ^-^^ sequence G^'^'^^ AG^'^+^^ . 

Fragments: Note that the sequences S{k) and S{k) are each composed of two substrings 
in the form of U^A'^, separated by a character G. Each of these two substrings is called 
a fragment. Similarly, the two substrings of the form separated by AG in 7r(k) and 
the two substrings of the form separated by the character A in TT{k) are also called 
fragments. 

Node Encoding: Each node in the three node sets X, Y, and Z is associated with 
a unique sequence. For 1 < i < n, let (xj), (j/,), (zi) denote the sequences 5{i), 6{n + i), 
6{2n + i), respectively. Intuitively, (xj) is the encoding of the node Xj, and similarly (yi) and 
(zi) are for the nodes yi and Zi, respectively. Furthermore, define (xj) = 6{i), (jjl) = 5{n + i), 
and (zj) = 5{2n + i). 

The node set X is associated with two sequences X = {xi)G{x2)G ■ ■ ■ G{xn) and X = 
(x^)G(x„_i)G • • • G{xi). Let X — xi = {xi)G ■ ■ ■ G(xj_i)G(xj+i)G ■ • • (x„) and X — Xi = 
{x^)G ■ ■ ■ G{xi+i)G{xi-i)G • ■ ■ G(xT), where Xj is any node in X. Similarly, the node sets 
Y and Z are associated with sequences 3^, y, and Z, Z, respectively. 

Edge Encoding: For each edge ej (where 1 < j < m), we define four delimiter sequences, 
namely, Vj = vr(j), Wj = 7r(m + 1 + j), Vj = Tr{j), and Wj = -^{171 + 1 + j). Assume that 
Cj = {xp. , yg. , Zr- ) . Then ej is encoded by the sequence Sj defined as 

AG Vj AG Wj AG X G y G Z G {Z - Zr^) G {y - yg^) G {X - Xp^) Yj AWj. 
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Let Sm+i be a special sequence defined as AG Vm+i AG Wm+i AG Z G y G X Vm+i A Wm+i 
In the following discussion, each Sj is referred to as a region. 

Finally, we define 5^; to be the sequence Sm+iSm • ■ ■ Si. Let a = 3n{3d — 2) + Qd — 1 
and let h = ma + n{6d — 4) + 12d — 5. Note that Se has 0((n + m)^) characters and can 
be constructed in 0(|5£;|) time. In Sections and |5.3| , we show that sp{Se) > h if and 
only if E contains a perfect matching. 

5.2 Correctness of the if-part 

This section shows that if E has a perfect matching, we can construct a planar secondary 
structure for Se containing at least h stacking pairs. Therefore, sp(S'£;) > h. 

First of all, we establish several basic steps for constructing stacking pairs on Se- 

• 6{i) or 5{i) itself can form d—1 stacking pairs, while 6{i) and 6{i) together can form 
3d — 2 stacking pairs. 

• 7r(i) and 7r(i) together can form 6d — 2 stacking pairs. 

• For any i ^ j, 7r(i) and 7r(j) together can form 6d — 3 stacking pairs. 

Lemma 5.1 If E has a perfect matching, then sp(Se) > h. 

Proof. Let M = {ej-^, ej^, . . . , ej^} be a perfect matching. Without loss of generality, we 
assume that 1 < ji < J2 < • • • < Jn ^ i^i- Define jn+i = m + 1. To obtain a planar 
secondary structure for Se with at least h stacking pairs, we consider the regions one by 
one. There are three cases. 

Case 1: We consider any region Sj such that ej ^ M. Our goal is to show that a = 
3n{3d — 2) + Qd — 1 stacking pairs can be formed within Sj. Note that there are (m — n) 
edges not in M. Thus, we can obtain a total of (m — n)a stacking pairs in this case. Details 
are as follows. Assume that ej = {xp.,yq., Zrj). 

• 6d — 2 stacking pairs can be formed between Vj and Vj, and between Wj and Wj. 

• 3d — 2 stacking pairs can be formed between (xj) and for all i 7^ Pj, and between 
{Ui) and (yi) for all i Qj, and between (zi) and (zj) for all i ^ rj. 

• (^^pj)) (y^j)) (^^) '^^'^ each form d—1 stacking pairs. 

The total number of stacking pairs that can be formed within Sj is 2{Qd — 2) +3(n — l){3d — 
2) + 3{d - 1) = 3n{3d - 2) + 6d - 1 = a. 

Case 2: We consider the edges ej^ , Cjj , • • • , ej„ in M. Our goal is to show that each 
corresponding region accounts for cr + 6(i — 4 stacking pairs. Thus, we obtain a total of 
n(T + n(6(i — 4) stacking pairs in this case. Details are as follows. Unlike Case 1, each region 
Sj^, where 1 < k < n, may have some of its bases paired with that of Sjf,_^_^. 

• 6d — 3 stacking pairs can be formed between Wj^ in Sj^, and Wj^_^^ in Sj^,_^^ . 

• 6d — 2 stacking pairs can be formed between Vj^, in Sj^ and V^,, in Sj^ . 
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• 3d — 2 stacking pairs can be paired between (xj) in Sj^ and (xj) in Sji. for any 
i ^ Pj^,... ,pj^, and between (y,) in and (y^) in Sj,^ for any i ^ qj^,... , g^,^, and 
between (zj) in and (zj) in Sj^, for any i / r^^, . . . , r^,,. 

• 3(i — 2 stacking pairs can be paired between {xi) in 5^,, and (xj) in for any 
i=Pj^,... ,pj^, and between {yi) in S'j^ and (yj) in for any i = q^^, . . . , qj^, and 
between (zj) in 5^,^^^ and (zj) in Sj^^^ for any i = rji, . . . ,rj^. 

The total number of stacking pairs charged to Sj^ is 6(i — 3 + 6d — 2 + 3n(3d — 2) = a + Qd — A. 
Case 3: We consider Sm+i- We can form Qd — 2 stacking pairs between Vm+i and Vm+i, 
and 6(i — 3 stacking pairs between Wm+i and VFjj . The number of such stacking pairs is 
12d-5. 

Combining the three cases, the number of stacking pairs that can be formed on Se is 
(m — n)a + n(cj + 6d — 4) + 12d — 5, which is exactly h. Notice that no two stacking pairs 
formed cross each other. Thus, sp{Se) > h. □ 



5.3 Correctness of the only-if part 



This section shows that if E has no perfect matching, then sp(S'e)< h. We first give the 
framework of the proof in Section 5.3.1| . Then, some basic definitions and concepts are 
presented in Section 5.3.2|. The proof of the only-if part is given in Section 5.3.3. 



5.3.1 Framework of the proof 

Let OPT be a secondary structure of Se with the maximum number of stacking pairs. Let 
#OPT be the number of stacking pairs in OPT. That is, ^^OPT = sp(S_e)- In this section, 
we will establish an upper bound for ^^OPT. Recall that we only consider Watson-Crick 
base pairs, i.e., A — U and C — G pairs. We define a conjugate of a substring in Se as 
follows. 

Conjugates: For every substring R = siS2 ... of Se, the conjugate oi Ris R = s\ . . . si, 
where A = U, U = A, C = G, and G = C. 

For example, ^^'s conjugate is UU and UA's conjugate is UA. To form a stacking pair, 
two adjacent bases must be paired with another two adjacent bases. So, we concentrate on 
the possible patterns of adjacent bases in Se- 

2-substrings: In Se, any two adjacent characters are referred to as a 2-substring. By 
construction, Se has only ten different types of 2-substrings: UU, AA, UA, GG, GC, GG, 
AG, GA, GU, and C^-substrings. A 2-substring can only form a stacking pair with its 
conjugate. If they actually form a stacking pair in OPT, they are said to be paired. 

Since the conjugates of AG, GA, GU, and CA-substrings do not exist in Se, there is 
no stacking pair in Se which involves these 2-substrings. We only need to consider AA, 
UU, UA, GG, CC, GC-substrings. Table || shows the numbers of occurrences of these 
2-substrings in Sj {1 < j <m + \) and the total occurrences of these substrings in Se- 
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Table 1: Number of occurrences of different 2-substrings 



Let if^AA denote the number of occurrences of AA-substrings in Se- We use the # nota- 
tion for other types of 2-subtrings in Se similarly. The following fact gives a straightforward 
upper bound for ^^OPT. 

Fact 5.2 #OPT < min{#^^, #UU} + min{#GG, #CC} + #UA/2 + #GC/2 
= h + n+l + (2m + 2). 

Note that OPT may not pair all AA-subtrings with [/[/-substrings. Let 'O'AA be the 
number of Td^d-substrings that are not paired in OPT. Again, we use the (} notaion for 



other types of 2-substrings. Fact 5.2 can be strengthened as follows. 



Fact 5.3 #OPT < min{#^^-0^^, #[/[/-<>[/[/} + min{#GG-0GG,#CC7-^CC} + 
{#UA - <)UA)/2 + (#GC - <)GC)/2. 



The upper bound given in Fact p.3| forms the basis of our proof for showing that #OPT < 
h. In the following sections, we consider the possible structure of OPT. For each possible 
case, we show that the lower bounds for some {> values, such as '(^AA and ()CC, are 
sufficiently large so that OPT can be shown to be less than h. In particular, in one of 
the cases, we must make use of the fact that E does not have a perfect matching in order 
to prove the lower bound for 'O'AA, (^UA, and (^UU. We give some basic definitions and 
concepts in Section 5.3.2. The lower bounds and the proof are given in Section 5.3.3|. 



5.3.2 Definitions and concepts 

In this section, we give some definitions and concepts which are useful in deriving lower 
bounds for values. We first classify each region Sj in Se as either open or closed with 
respect to OPT. Then, extending the definitions of fragments and conjugates, we introduce 
conjugate fragments and delimiter fragments. Finally, we present a property of delimiter 
fragments in open regions. 



Open and closed regions: With respect to OPT, a region Sj in Se is said to be an open 
region if some UU, AA, or [/A-substrings in Sj are paired with some 2-substrings outside 
Sj-, otherwise, it is a closed region. 

Lemma 5.4 // Sm+i is a closed region, then #OPT < h. 
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Proof. Sm+i has 3nd more ^^-substrings than [/ [/-substrings. If Sm+i is a closed 



region, these 3nd ^^-substrings are not paired by OPT. Thus, '(}AA > 3nd. By Fact 5^ 
#OPT < /i + (n + 1) + (2m + 2) - 3nd < h. D 

Recah that 5^; is a sequence composed of J's, 6's, vr's, and vf's. Each 5{k) (respectively 



S{k)) consists of two substrings of the form U^A~^, each of these substrings is called a 
fragment. Furthermore, each 7r(/c) (resp. Tr{k)) consists of two substrings of the form C"*" 
(respectively G"*"), each of these subtrings is also called a fragment. 



Conjugate fragments and delimiter fragments: Consider any fragment F in Se- 
Another fragment F' in Se is called a conjugate fragment of F if F' is the conjugate of F. 
Note that if F is a fragment of a certian 5{k) (resp. 7r(fe)), then F' appears only in some S(k) 
(respectively vr(A:)), and vice versa. By construction, if F is a fragment of some delimiter 
sequence Vj or Wj, then F has a unique conjugate fragment in Se, which is located in Vj 
or Wj, respectively. However, if F is a fragment of some non-delimiter sequence, says, (xj), 
then for every instance of in Se, F contains one conjugate fragment in (xj). 

A fragment F is said to be paired with its conjugate fragment F' by OPT if OPT 
includes all the pairs of bases between F and F' . 

For 1 < J < m + 1, the fragment F in Vj or Wj is called a delimiter fragment. Note 
that the delimiter fragment F should be of the form (7^'^+'^ for 2d > k > 0. 

The following lemma shows a property of delimiter fragments in open regions. 

Lemma 5.5 If Sj is an open region, then both delimiter fragments of either Vj orWj must 
not pair with their conjugate fragments in OPT. 

Proof. We prove the statement by contradiction. Suppose one fragment of Vj and one 
fragment of Wj are paired with their conjugate fragments. Let {sx, Sx+i; Sy-i, Sy) and 
{sx' , Sx'+i; Syi-i, Sy') be some particular stacking pairs in Vj and Wj, respectively. Since 
Sj is an open region, we can identify a stacking pair {sx", Sx"+i; Sy"-i, Sy") where Sx"Sx"+i 
and Syii_iSyii are 2-substrings within and outside Sj, respectively. Note that these three 
stacking pairs form an interleaving block. By Lemma |2.1| , OPT is not planar, reaching a 
contradiction. □ 



5.3.3 Proof of the only-if part 

By Lemma ^.4| , it suffices to assume that 5^+1 is an open region. Before we give the proof 
of the only-if part, let us consider the following lemma. 

Lemma 5.6 Let a he the number of delimiter fragments that are not paired with their 
conjugate fragments. Then, <}CC + OGG >a + {#GC - <}GC). 

Proof. By construction, a GC-substring must be next to the left end of a delimiter fragment 
F, which is of the form G^. No other GG-substrings can exist. If this GG-substring is 
paired, the leftmost GG-substring of F must not be paired as there is no GGG pattern 
in Se- Thus, F must be one of the a delimiter fragments that are not paired with their 
conjugate fragments. Based on this observation, we classify the a delimiter fragments into 
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two groups: (1) {i^GC — <)GC)'s delimiter fragments whose GC-substrings at the left end 
are paired; and (2) a — {#GC — {>GC)'s delimiter fragments whose GC-substrings at the 
left end are not paired. 

For each delimiter fragment F = C'^'^^^ in group (1), since the GC-substring on the 
left of F is paired, the leftmost GG-substring of F must not be paired by OPT. For the 
remaining 2d + k — 2 GG-substrings, we either find a GG-substring which is not paired by 
OPT; or these 2d + k — 2 GG-substrings are paired to GG-substrings in some fragment 
F' = g2^+'=' with 2d> k' > k, and thus, some GG-substring of F' is not paired. Therefore, 
each delimiter fragment in group (1) introduces either (i) two unpaired GG-substrings or 
(ii) one unpaired GG-substring and one unpaired GG-substring. Hence, the total number of 
unpaired GG and GG-substrings due to delimiter fragments in group (1) > 2(#GG— OGG). 

For each delimiter fragment F = C'^'^^^ in group (2), consider the GG-substrings in 
F. With a similar argument, we can show that each delimiter fragment in group (2) 
introduces either (i) one unpaired GG-substring or (ii) one unpaired GG-substring. Hence, 
the total number of unpaired GG and GG-substrings due to delimiter fragments in group 
(2) > a - (#GG - OGG). 

In total, we have OGG + OGG > q + (#GG - OGG). □ 

Now, we state a lemma which shows the lower bounds for some values in terms of 
the number of open regions in OPT. 

Lemma 5.7 Let £ > 1 be the number of open regions in OPT. 

(1) If Sm+i is an open region, then ()UU > 3{m -|- 1 — £)d. 

(2) max{OGG, OGG} > £ + (#GG - OGG) /2. 

(3) If i = n + 1, Sm+i is an open region, and E does not have a perfect matching, then 
either (a) <}UU > 3(m - n)d + I, (b) <)AA > 1, or (c) <)U A > 2. 

Proof. 

Statement 1. Within each closed region Sj where j ^ m + 1, 3ci's [/[/-substrings cannot 
paired in OPT. As there are m + 1 — i such closed regions, 3(m + 1 — i)d [/[/-substrings 
are not paired in OPT. Thus, (^UU > 3(m -|- 1 — i)d. 

Statement 2. By Lemma we can identify 2i fragments in Vj and Wj of all open 
regions which are not paired with their conjugate fragments. Then, by Lemma ^.6| , we have 
OGG + OGG >2i + {#GC - <}GC). Thus, max{OGG, OGG} >£+ (#GG - 0GG)/2. 

Statement 3. By a similar argument to the proof for Statement 1, within the m + 1 — i = 
m — n closed regions, 3(m — n)d [/[/-substrings are not paired in OPT. 

For the £ = n + 1 open regions, one of them must be Sm+i- Let Sj^^, . . . , Sj„ be the 
remaining n open regions. Recall that ej^, . . . , ej„ are the corresponding edges of these n 
open regions. Since these n edges cannot form a perfect matching, some node, says Xk, is 
adjacent to these n edges more than once. Thus, within Sj-^^, . . . , Sj^, Sm+i, we have more 
(xk) than (xfc). Therefore, at least two of the fragments in all (xk) are not paired with their 
conjugate fragments. 
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Let F be one of such fragments. Note that F is of the form U'^A'^. Since F is not paired 
with its conjugate fragment, one of the following three cases occurs in OPT: 

Case 1: An [/ [/-substring of F is not paired. 

Case 2: An AA-substring of F is not paired. 

Case 3: All [/[/-substrings and AA-substrings F are paired. In this case, of F is paired 
with of a fragment F' = U'' A^; and A'' of F is paired with some substring U'^ of some 
fragment F" . As F' and F" are not the same fragment, the [/^-substrings of both F and 
F' are not paired. 

In summary, we have either (1) ()UU > 3(m — or (2) (}AA > 1, or (3) (}UA > 2. 

□ 



Based on Lemma 5.7, we prove the only-if part by a case analysis in the following lemma. 



Lemma 5.8 // E does not have a prefect matching, then ^OPT < h. 

Proof. Recall that if Sm+i is a closed region, then ^^OPT < h. Now, suppose that Sm+i 
is an open region. We show ^^OPT < /i in three cases i<n + \, i>n+l and i = n + 1. 

Case 1: i < n+ 1. By Lemma (1), 'C'UU > 3(m+ 1 — l)d. By Fact we can conclude 
that #OPT = h + n+l + {2m + 2) - + I - i)d < h + n + I + (2m + 2)-'id<h. 

Case 2: I > n + I. By Lemma ^ (2), max{^C7C, <)GG] > 1 + {#GC - <?GC)/2. By 
Fact #OPT <h + n + l — £, which is smaller than h because £ > n + 1. 

Case 3: £ = n+l. By Lemma ^ (3), either (a) <>[/[/ > 3(m - n)d + 1, or (b) <}AA > 1, 



or (c) OUA > 2. By Fact |5j, #OPT <h + n- max{OCC, (^GG} + (#GC - <}GC)/2. 
By Lemma ^ (2), we have #OPT < h. U 

We conclude that if E does not have a prefect matching, then #OPT < h. Equivalently, 
if T^OPT > h, then E has a prefect matching. 



6 Conclusions 

In this paper, we have studied the problem of predicting RNA secondary structures that 
allow arbitrary pseudoknots with a simple free energy function that is minimized when 
the number of stacking pairs is maximized. We have proved that this problem is NP-hard 
if the secondary structure is required to be planar. We conjecture that the problem is 
also NP-hard for the general case. We have also given two approximation algorithms for 
this problem with worst-case approximation ratios of 1/2 and 1/3 for planar and general 
secondary structures, respectively. It would be of interest to improve these approximation 
ratios. 

Another direction is to study the problem using energy function that is minimized 
when the number of base pairs is maximized. It is known that this problem can be solved 
in cubic time if the secondary structure can be non-planar ||^] . However, the computational 
complexity of the problem is still open if the secondary structure is required to be planar. 
We conjecture that the problem becomes NP-hard under this additional condition. We 



17 



would like to point out that the observation that have enabled us to visualize the planarity 
of stacking pairs on a rectangular grid does not hold in case of maximizing base pairs. 
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